Prague Dependency Treebank: Restoration of Deletions

نویسندگان

  • Eva Hajicová
  • Ivana Kruijff-Korbayová
  • Petr Sgall
چکیده

The use of the treebank as a resource for linguistic research has led us to look for an annotation scheme representing not only surface syntactic information (in ‘analytic trees’, ATS) but also the underlying syntactic structure of sentences and at least some aspects of intersentential links (in ‘tectogrammatical tree structures’, TGTS). We focus in this paper on some of the issues of the transduction of ATSs into TGTSs. 1 Two steps of syntactic tagging in PDT In the Prague Dependency Treebank (PDT) project, the structure of sentences is made explicit by means of two steps of syntactic tagging resulting in: (i) ‘analytic’ tree structures (ATSs), in which every word form and punctuation mark is represented as a node of the tree, and the edges of the tree correspond to (surface) syntactic dependency relations; and, (ii) tectogrammatical tree structures (TGTSs) corresponding to underlying sentence representations and having the shape of dependency trees with the verb as the root of the tree.1 In TGTSs the functional (synsemantic) words (such as prepositions, auxiliaries, subordinating conjunctions) as well as punctuation marks are principally not represented by nodes of their own; their functions are captured as parts of complex tags of the nodes standing for autosemantic (content) words. Surface deletions are ‘restored’ in TGTSs. The syntactic information which is absent in the surface (morphemic) shape of the sentence is introduced at least for the time being in the manual phase of the transduction procedure ([Hajičová et al. 1998]), translating (in a ‘userfriendly’ environment) ATSs to TGTSs. Every added (restored) node gets the index ELEX (if its antecedent is an expanded head node) or ELID (if this is not so). The added nodes always depend on their governors from the left-hand side, except for certain cases in coordinated constructions (cf. (2) below). ? The work reported on in this paper has been supported by the grant of the Czech Ministry of Education VS 96/151 and by the Czech Grant Agency GAČR 405/96/K214. 1 With the exception of TGTSs for coordinated constructions, see below.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spanish Phoneme Classification by Means of a Hierarchy of Kohonen Self-Organizing Maps

Research Issues for the Next Generation Spoken Dialogue Systems p. 1 Data-Driven Analysis of Speech p. 10 Towards a Road Map for Machine Translation Research p. 19 The Prague Dependency Treebank: Crossing the Sentence Boundary p. 20 Text Tiered Tagging and Combined Language Models Classifiers p. 28 Syntactic Tagging p. 34 Information, Language, Corpus and Linguistics p. 39 Prague Dependency Tre...

متن کامل

Difference between Written and Spoken Czech: The Case of Verbal Nouns Denoting an Action

The present paper extends understanding of differences in expressing actions by verbal nouns in corpora of written vs. spoken Czech, namely in the Czech part of the Prague CzechEnglish Dependency Treebank and in the Prague Dependency Treebank of Spoken Czech. We show that while the written corpus includes more complex noun phrases with more explicit expression of adnominal participants, noun ph...

متن کامل

Complex Corpus Annotation: The Prague Dependency Treebank

The Prague Dependency Treebank (Hajič et al., 2001) is approaching the publication of its second version in which the tectogrammatical annotation is being added to the morphological and analytical (surface-syntactic) one. In this article, the Prague Dependency Treebank as a whole is being described, including its brief history. In this volume, there are three more papers with a detailed account...

متن کامل

Learning to Search in Prague Dependency Treebank

We present Netgraph – an easy to use tool for searching in linguistically annotated treebanks. On several examples from the Prague Dependency Treebank we introduce the features of the searching language and show how to search for some frequent linguistic phenomena.

متن کامل

The Theory of Control Applied to the Prague Dependency Treebank (PDT)

One of the most difficult issues within corpora annotation on an underlying syntactic level is the restoration of nodes omitted in the surface shape of the sentence, but present on the “underlying” or “deep” syntactic level. In the present paper we concentrate on such type of nodes which are omitted due to the phenomenon usually called grammatical “control” with regard to their respective anaph...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999